{
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Choice of Metric in Machine Learning"
      ],
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "The metric distance measure is used for how two data are similar. For example, if the distance is small, it will be the high degree of similarity where large distance will be the low degree of similarity.\n",
        "\n",
        "Two main similarity:        \n",
        "If X = Y, similarity = 1 (where X & Y are two objects)      \n",
        "If X NOT = Y, similarity = 0"
      ],
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Euclidean distance - distance between two points in the plane in Euclidean space and with the distance, Euclidean space becomes a metric spaces. This most common use for distance."
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "import matplotlib.pyplot as plt\n",
        "\n",
        "import warnings\n",
        "warnings.filterwarnings(\"ignore\")\n",
        "\n",
        "# fix_yahoo_finance is used to fetch data \n",
        "import fix_yahoo_finance as yf\n",
        "yf.pdr_override()"
      ],
      "outputs": [],
      "execution_count": 1,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# input\n",
        "symbol = 'AMD'\n",
        "start = '2014-01-01'\n",
        "end = '2018-08-27'\n",
        "\n",
        "# Read data \n",
        "dataset = yf.download(symbol,start,end)\n",
        "\n",
        "# Only keep close columns \n",
        "dataset.head()"
      ],
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[*********************100%***********************]  1 of 1 downloaded\n"
          ]
        },
        {
          "output_type": "execute_result",
          "execution_count": 2,
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Open</th>\n",
              "      <th>High</th>\n",
              "      <th>Low</th>\n",
              "      <th>Close</th>\n",
              "      <th>Adj Close</th>\n",
              "      <th>Volume</th>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>Date</th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "      <th></th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>2014-01-02</th>\n",
              "      <td>3.85</td>\n",
              "      <td>3.98</td>\n",
              "      <td>3.84</td>\n",
              "      <td>3.95</td>\n",
              "      <td>3.95</td>\n",
              "      <td>20548400</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2014-01-03</th>\n",
              "      <td>3.98</td>\n",
              "      <td>4.00</td>\n",
              "      <td>3.88</td>\n",
              "      <td>4.00</td>\n",
              "      <td>4.00</td>\n",
              "      <td>22887200</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2014-01-06</th>\n",
              "      <td>4.01</td>\n",
              "      <td>4.18</td>\n",
              "      <td>3.99</td>\n",
              "      <td>4.13</td>\n",
              "      <td>4.13</td>\n",
              "      <td>42398300</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2014-01-07</th>\n",
              "      <td>4.19</td>\n",
              "      <td>4.25</td>\n",
              "      <td>4.11</td>\n",
              "      <td>4.18</td>\n",
              "      <td>4.18</td>\n",
              "      <td>42932100</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2014-01-08</th>\n",
              "      <td>4.23</td>\n",
              "      <td>4.26</td>\n",
              "      <td>4.14</td>\n",
              "      <td>4.18</td>\n",
              "      <td>4.18</td>\n",
              "      <td>30678700</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "            Open  High   Low  Close  Adj Close    Volume\n",
              "Date                                                    \n",
              "2014-01-02  3.85  3.98  3.84   3.95       3.95  20548400\n",
              "2014-01-03  3.98  4.00  3.88   4.00       4.00  22887200\n",
              "2014-01-06  4.01  4.18  3.99   4.13       4.13  42398300\n",
              "2014-01-07  4.19  4.25  4.11   4.18       4.18  42932100\n",
              "2014-01-08  4.23  4.26  4.14   4.18       4.18  30678700"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 2,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "dataset.shape"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 3,
          "data": {
            "text/plain": [
              "(1172, 6)"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 3,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "X = dataset['Open'].values.reshape(1172,-1)\n",
        "Y = dataset['Adj Close'].values.reshape(1172,-1)"
      ],
      "outputs": [],
      "execution_count": 4,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.metrics.pairwise import euclidean_distances"
      ],
      "outputs": [],
      "execution_count": 5,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "euclidean_distances(X, X) # distance between rows of X"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 6,
          "data": {
            "text/plain": [
              "array([[ 0.      ,  0.13    ,  0.16    , ..., 17.340001, 19.06    ,\n",
              "        21.090001],\n",
              "       [ 0.13    ,  0.      ,  0.03    , ..., 17.210001, 18.93    ,\n",
              "        20.960001],\n",
              "       [ 0.16    ,  0.03    ,  0.      , ..., 17.180001, 18.9     ,\n",
              "        20.930001],\n",
              "       ...,\n",
              "       [17.340001, 17.210001, 17.180001, ...,  0.      ,  1.719999,\n",
              "         3.75    ],\n",
              "       [19.06    , 18.93    , 18.9     , ...,  1.719999,  0.      ,\n",
              "         2.030001],\n",
              "       [21.090001, 20.960001, 20.930001, ...,  3.75    ,  2.030001,\n",
              "         0.      ]])"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 6,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "euclidean_distances(X, Y) # get distance to origin"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 7,
          "data": {
            "text/plain": [
              "array([[1.0000000e-01, 1.5000000e-01, 2.8000000e-01, ..., 1.8440001e+01,\n",
              "        2.0130000e+01, 2.1410000e+01],\n",
              "       [3.0000000e-02, 2.0000000e-02, 1.5000000e-01, ..., 1.8310001e+01,\n",
              "        2.0000000e+01, 2.1280000e+01],\n",
              "       [6.0000000e-02, 1.0000000e-02, 1.2000000e-01, ..., 1.8280001e+01,\n",
              "        1.9970000e+01, 2.1250000e+01],\n",
              "       ...,\n",
              "       [1.7240001e+01, 1.7190001e+01, 1.7060001e+01, ..., 1.1000000e+00,\n",
              "        2.7899990e+00, 4.0699990e+00],\n",
              "       [1.8960000e+01, 1.8910000e+01, 1.8780000e+01, ..., 6.1999900e-01,\n",
              "        1.0700000e+00, 2.3500000e+00],\n",
              "       [2.0990001e+01, 2.0940001e+01, 2.0810001e+01, ..., 2.6500000e+00,\n",
              "        9.6000100e-01, 3.1999900e-01]])"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 7,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from scipy.spatial import distance"
      ],
      "outputs": [],
      "execution_count": 8,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "distance.euclidean(X, Y)"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 9,
          "data": {
            "text/plain": [
              "8.051006457582059"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 9,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.neighbors import DistanceMetric"
      ],
      "outputs": [],
      "execution_count": 10,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "dist = DistanceMetric.get_metric('euclidean')\n",
        "dist.pairwise(X)"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 11,
          "data": {
            "text/plain": [
              "array([[ 0.      ,  0.13    ,  0.16    , ..., 17.340001, 19.06    ,\n",
              "        21.090001],\n",
              "       [ 0.13    ,  0.      ,  0.03    , ..., 17.210001, 18.93    ,\n",
              "        20.960001],\n",
              "       [ 0.16    ,  0.03    ,  0.      , ..., 17.180001, 18.9     ,\n",
              "        20.930001],\n",
              "       ...,\n",
              "       [17.340001, 17.210001, 17.180001, ...,  0.      ,  1.719999,\n",
              "         3.75    ],\n",
              "       [19.06    , 18.93    , 18.9     , ...,  1.719999,  0.      ,\n",
              "         2.030001],\n",
              "       [21.090001, 20.960001, 20.930001, ...,  3.75    ,  2.030001,\n",
              "         0.      ]])"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 11,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Squared euclidean distance - two data points involves computing the square root of the sum of the squares of the differences between corresponding values. "
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# scipy.spatial.distance.pdist\n",
        "from scipy.spatial.distance import pdist"
      ],
      "outputs": [],
      "execution_count": 12,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "squared_eculidean = pdist(X, metric='euclidean')\n",
        "squared_eculidean"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 13,
          "data": {
            "text/plain": [
              "array([0.13    , 0.16    , 0.34    , ..., 1.719999, 3.75    , 2.030001])"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 13,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Manhattan distance - is a point and a line. The distance between a point and a line is defined as the smallest distance between any point on the line."
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "from scipy.spatial import distance"
      ],
      "outputs": [],
      "execution_count": 14,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "distance.cityblock(X, Y)"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 15,
          "data": {
            "text/plain": [
              "170.850004"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 15,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.metrics.pairwise import manhattan_distances"
      ],
      "outputs": [],
      "execution_count": 16,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "manhattan_distances(X, Y, sum_over_features=False)"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 17,
          "data": {
            "text/plain": [
              "array([[0.1     ],\n",
              "       [0.15    ],\n",
              "       [0.28    ],\n",
              "       ...,\n",
              "       [2.65    ],\n",
              "       [0.960001],\n",
              "       [0.319999]])"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 17,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Maximum distance (Chebyshev distance) - is a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension."
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "from scipy.spatial import distance"
      ],
      "outputs": [],
      "execution_count": 18,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "distance.chebyshev(X,Y)"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 19,
          "data": {
            "text/plain": [
              "1.4100000000000001"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 19,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.neighbors import DistanceMetric\n",
        "\n",
        "dist = DistanceMetric.get_metric('chebyshev')\n",
        "dist"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 20,
          "data": {
            "text/plain": [
              "<sklearn.neighbors.dist_metrics.ChebyshevDistance at 0x1cbbccef518>"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 20,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "dist.pairwise(X)"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 21,
          "data": {
            "text/plain": [
              "array([[ 0.      ,  0.13    ,  0.16    , ..., 17.340001, 19.06    ,\n",
              "        21.090001],\n",
              "       [ 0.13    ,  0.      ,  0.03    , ..., 17.210001, 18.93    ,\n",
              "        20.960001],\n",
              "       [ 0.16    ,  0.03    ,  0.      , ..., 17.180001, 18.9     ,\n",
              "        20.930001],\n",
              "       ...,\n",
              "       [17.340001, 17.210001, 17.180001, ...,  0.      ,  1.719999,\n",
              "         3.75    ],\n",
              "       [19.06    , 18.93    , 18.9     , ...,  1.719999,  0.      ,\n",
              "         2.030001],\n",
              "       [21.090001, 20.960001, 20.930001, ...,  3.75    ,  2.030001,\n",
              "         0.      ]])"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 21,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Mahalanobis distance - is a mesurement of the distance between a point P and a distribution D. \n",
        "\nhttps://www.statisticshowto.datasciencecentral.com/mahalanobis-distance/"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.neighbors import DistanceMetric"
      ],
      "outputs": [],
      "execution_count": 22,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "dist = DistanceMetric.get_metric('mahalanobis', V=np.cov(X))\n",
        "dist"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 23,
          "data": {
            "text/plain": [
              "<sklearn.neighbors.dist_metrics.MahalanobisDistance at 0x1cbbccef588>"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 23,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "dist.rdist_to_dist(X)"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 24,
          "data": {
            "text/plain": [
              "array([[1.96214169],\n",
              "       [1.99499373],\n",
              "       [2.00249844],\n",
              "       ...,\n",
              "       [4.60325982],\n",
              "       [4.78643918],\n",
              "       [4.9939965 ]])"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 24,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "dist.dist_to_rdist(X)"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 25,
          "data": {
            "text/plain": [
              "array([[ 14.8225    ],\n",
              "       [ 15.8404    ],\n",
              "       [ 16.0801    ],\n",
              "       ...,\n",
              "       [449.01614238],\n",
              "       [524.8681    ],\n",
              "       [622.00364988]])"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 25,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Jaccard Metric"
      ],
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "Jaccard Metrics is similar to coefficient score"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "from scipy.spatial import distance\n",
        "\ndistance.jaccard(X, Y)"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 27,
          "data": {
            "text/plain": [
              "0.96160409556314"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 27,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from sklearn.metrics import jaccard_similarity_score\n",
        "\n",
        "X = dataset['Open'].astype(int)\n",
        "Y = dataset['Adj Close'].astype(int)\n",
        "jaccard_similarity_score(X, Y)"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 29,
          "data": {
            "text/plain": [
              "0.8447098976109215"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 29,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    },
    {
      "cell_type": "code",
      "source": [
        "jaccard_similarity_score(X, Y, normalize=False)"
      ],
      "outputs": [
        {
          "output_type": "execute_result",
          "execution_count": 30,
          "data": {
            "text/plain": [
              "990"
            ]
          },
          "metadata": {}
        }
      ],
      "execution_count": 30,
      "metadata": {
        "collapsed": false,
        "outputHidden": false,
        "inputHidden": false
      }
    }
  ],
  "metadata": {
    "kernel_info": {
      "name": "python3"
    },
    "language_info": {
      "pygments_lexer": "ipython3",
      "file_extension": ".py",
      "nbconvert_exporter": "python",
      "mimetype": "text/x-python",
      "name": "python",
      "version": "3.5.5",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      }
    },
    "kernelspec": {
      "name": "python3",
      "language": "python",
      "display_name": "Python 3"
    },
    "nteract": {
      "version": "0.12.2"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}